Skip to content

feat(compiler+recorder): contenteditable typing capture, trace_viewer view fixes, SVG-clickable highlight, and dual default-alias config#66

Merged
softpudding merged 9 commits into
mainfrom
feat/compiler-default-alias
Apr 25, 2026
Merged

feat(compiler+recorder): contenteditable typing capture, trace_viewer view fixes, SVG-clickable highlight, and dual default-alias config#66
softpudding merged 9 commits into
mainfrom
feat/compiler-default-alias

Conversation

@softpudding

Copy link
Copy Markdown
Owner

Summary

This branch lands several related changes uncovered while debugging a real
recording where text typed into Yuque's contenteditable document body was
silently dropped between recorder and compiler. Fixing that one bug surfaced
gaps at every layer of the recording → compile → replay pipeline, which this
branch addresses end-to-end. It also brings forward a few earlier
extension/highlight fixes and a server-side LLM-config split that were
already on the branch.

The 8 commits group into four themes:

1. Recorder + compiler: capture and surface contenteditable typing

Rich-text editors (Yuque's Lake editor, ProseMirror, Slate, Lexical, TipTap)
intercept keystrokes at keydown + preventDefault and apply edits via
their own DOM model, so native input events never fire on the body. The
old extension input listener also explicitly filtered to
HTMLInputElement/HTMLTextAreaElement only — so typing into a Yuque doc
produced zero input events in the trace. Even when surrounding
keydown Enter events captured an HTML snapshot, the compiler-side
trace_viewer truncated input values at 80 chars in the events view and
omitted them entirely from normalized_steps, so user instructions typed at
the end of a body (after URL paste) never reached the LLM.

Recorder (extension/src/content/index.ts, commit 8f11aa6):

  • Broaden the input listener filter to also accept contenteditables.
  • New beforeinput listener for contenteditable-only — captures keystrokes
    before the editor intercepts, with the DOM snapshot deferred to a
    microtask so the serialized value reflects post-mutation state.
  • New isContentEditableElement() / getContentEditableText() helpers in
    serializeElement() populate value/valueLength/isContentEditable: true
    for contenteditable targets.
  • coalesce_typing_events upstream (fa4913b) folds runs of consecutive
    typing events on the same element into one — keeps event_detail working
    on any absorbed index but stops a 100-keystroke burst from burying every
    click in noise.

Compiler (server/core/compiler_agent.py + server/core/workflow_compiler.py):

  • _format_value_with_tail(value, head=200, tail=200) renders long input
    values with both ends visible and a …(N more chars; use event_detail)…
    middle marker in the events view (replaces hard 80-char head-only
    truncation).
  • _handle_normalized_steps surfaces a per-anchor
    field <selector> final_value="…" line for [form] steps, picking the
    latest snapshot in event order (paste-then-trim, backspace-heavy edits,
    and clear-and-rewrite all break the longest-wins heuristic).
  • _extract_input_value falls back to _extract_visible_text_from_html on
    element.html (then to element.text) when value is missing — recovers
    contenteditable text from older traces too. Sensitive fields still bypass
    the fallback.
  • system_prompt_compiler.j2 updated (in agent-sdk PR via commit
    32e6edba over there, mirrored in the venv copy on this branch via
    local_vendor cleanup) to describe the new render markers and the
    final_value line semantics for contenteditable bodies.

Tests (new):

  • server/tests/unit/test_workflow_compiler_contenteditable.py (8 tests) —
    HTML fallback, malformed input, script/style skipping, sensitive-field
    refusal, plain-text fallback.
  • server/tests/unit/test_compiler_agent_value_view.py (6 tests) —
    _format_value_with_tail edges + _handle_normalized_steps form-step
    final-value rendering with paste-then-trim coverage.
  • server/tests/unit/test_coalesce_typing_events.py (from fa4913b).

2. Extension fixes already on the branch

  • 83d27c0 fix(extension): type via CDP Input.dispatchKeyEvent per character
  • 274c37a fix(highlight): detect SVG graphics elements as clickable

274c37a is the primary cause of the bidirectional movement seen in §4
below: it was needed for mapquest_nearby_pins (where <circle> pins were
invisible to highlight scan) and yields +22 summed score across the four
models on that test alone, but produces a small rubric side-effect on
bluebook_simple (where the agent now likes the SVG heart from the search
card, skipping the note_open rubric criterion).

3. Compiler-default-alias config + UI surface

  • 224d025 feat: separate compiler-agent default LLM from general agent default
    server/core/llm_config.py, server/api/routes/config.py, frontend
    surface and tests. Lets the compiler use a stronger model than the runtime
    agent (e.g. plus for compile, flash for execute).
  • 3ff942b chore: remove stray local_vendor/ directory
    removes a stale checkin of the agent-sdk system prompt that diverged from
    the upstream copy.

4. Eval scaffolding + reports

  • eval/routine_eval/fixtures/github-trending-contenteditable-question/
    — new fixture pinning the regression. intent_note.txt (1 line),
    raw_intention.md (history + ground truth), expectations.yaml (required
    position-vs-identity question; forbids "what text did the user type"; the
    expected_routine_content block requires the routine to mention all three
    agent-investigation prompts).
  • eval/evaluation_report.json — refreshed benchmark from the 2026-04-24
    full eval (105/140 PASSED, 75.0%).
  • eval/routine_eval/evaluate_routine_compile.py — namespaces the
    canonical regression report by compile_alias so a multi-model loop
    produces compile_evaluation_report_<alias>.json per run instead of every
    run clobbering the same file.
  • 4 per-model canonical compile-eval reports
    (qwen3{5,6}{plus,flash}-fast.json).
  • skill/claude/ob-routines/SKILL.md — fixes for the recording skill that
    surfaced while running this branch end-to-end: tmux launch keeps the
    window alive via exec zsh, Monitor template now detects pane-gone,
    [compiler:saved] is verified via list_routines.py, and the gate
    reasoning is explicitly Claude's responsibility to write out as
    user-visible text.

Eval results (2026-04-24 full run, 4 × -fast models, 35 tests × 4 = 140 runs)

Model Pass rate Score / Max Δ score vs main
qwen3.5-plus 85.7% 274.3 / 304.8 −1.9
qwen3.5-flash 60.0% 232.2 / 304.8 −10.9
qwen3.6-plus 74.3% 251.4 / 304.8 −11.0
qwen3.6-flash 80.0% 274.5 / 304.8 +1.5
  • Raw total delta: −21.8 / 1219.2 (−1.8%)
  • Infra-adjusted delta (subtracting two confirmed 0.0-score
    400 Bad Request infra kills + one LLMBadRequestError mid-flow):
    −4.8 / 1219.2 (−0.4%) — within stochastic range.

Bidirectional movement (the dominant pattern):

  • 4 of the top-5 improvements are on mapquest_nearby_pins
    (+3.0 / +5.5 / +6.0 / +7.5 across models — exactly the test that
    motivated 274c37a).
  • The largest regression is bluebook_simple on qwen3.5-flash (−2.0)
    — a known rubric-coupled side effect of the same SVG-clickable change.
    Net 274c37a effect across the suite: ~+22 mapquest gains, ~−2 to ~−4
    bluebook costs. Trade-off is real but heavily positive.
  • The please_help_me tool is observed as a soft killswitch in eval mode
    (gmail_vendor_escalation, two models). Recommend a future harness fix to
    auto-reject the call so the agent doesn't stall waiting for a human that
    never comes.

Full root-cause analysis with per-failure entries (F1–F25) lives in
tmp/observation_notes_20260424_100152.md and the rolled-up report at
tmp/OBSERVATION_REPORT_20260424_100152.md on this branch checkout (not
committed; the artifacts are reproducible from the eval command line in
the report header).

The compiler/recorder work itself does not show any regression on the
agent-loop eval — those changes only affect the compile path, not agent
execution.

Routine-compile eval (4 × -fast models, 3 fixtures × 4 = 12 runs)

Per-model canonical reports now in eval/routine_eval/. Pass rates:

Model Pass Notes
qwen3.5-plus 1/3 New fixture: intent_match=1.0 (was 0.4 pre-fix) — confirms the trace_viewer changes reach the LLM.
qwen3.6-plus 2/3 Best of the four.
qwen3.5-flash 0/3 Consistent across runs — flash drift on multi-step tasks.
qwen3.6-flash 1/3

The new github-trending-contenteditable-question fixture is genuinely
hard — even when the model picks up the typed instructions correctly
(intent_match=1.0 on qwen3.5-plus), it can still fail on Keywords
placement or asking-behavior. That is by design: the fixture is a
multi-axis stress test of the contenteditable pipeline.

Dependency

  • agent-sdk: pinned to 32e6edba2178eac73afea6d0a3bdf452d621394a on the
    open-browser branch — that commit contains the matching prompt update
    (feat(compiler): surface long input values and form final_value in trace_viewer). pyproject.toml and uv.lock updated, lock matches the
    pin.

Test plan

  • uv run pytest -q — 499 passed, 4 skipped, 6 warnings
  • npm --prefix extension test — 195 pass / 0 fail / 564 expect() calls
  • Pre-commit (black + prettier + eslint + check-toml + check-yaml) on
    every file touched on this branch
  • Real-recording end-to-end: recorded a live Yuque doc edit on this
    branch; the typed "Write also: 1. A brief intro 2. What's special 3. Why's it trending" is now a first-class chunk of the trace and
    the compiler agent recognises it as agent-investigation prompts on
    its own without manual gate feedback (qwen3.5-plus run, see
    eval/routine_eval/fixtures/github-trending-contenteditable-question/).

🤖 Generated with Claude Code

softpudding and others added 9 commits April 21, 2026 13:27
The Compiler Agent was falling through to `default_llm_alias`, which is
typically a small/cheap model (qwen3.5-flash) unsuitable for compiling
recordings into routines. Introduce `default_compiler_alias` so operators
can point the compiler at a stronger model (e.g. qwen3.5-plus) without
changing the agent default. Empty/unset falls back to the agent default.

- server: add AppConfig.default_compiler_alias, get_compiler_llm_config(),
  set_default_compiler_alias(); route compiler_agent and /recordings
  compile pre-validation through the new resolver.
- api/config: surface and accept default_compiler_alias; validate against
  submitted aliases before persisting (avoids half-saved state on 400).
- skill/ob-routines: Claude queries /api/config and, if
  default_compiler_alias is unset, picks the best available alias
  (plus > flash, avoid coding endpoint's tighter quota) and passes it
  via --model-alias. Always reports the chosen model to the user.
- frontend: Compiler-default dropdown in the model settings panel
  (— use agent default — + one option per configured alias), synced as
  aliases are added/renamed/removed.
- tests: new test_llm_config_manager covers alias selection, fallback,
  and auto-reset when the configured alias is removed; route tests
  cover POST validation ordering and persistence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sites like Zhihu rotate hot-topic text into the search input on a
timer, pausing only when they see real keydown/keyup events. The old
path set `el.value = text` and dispatched a synthetic `input` event;
no keystroke events fired, so the rotator kept ticking and clobbered
typed text during the LLM turn gap — the Zhihu-search-AI bug.

performKeyboardInput now:
 - focuses and clears the editable via JS (keeps existing activation
   helpers for labels and shadow roots)
 - types each character via Input.dispatchKeyEvent, so keydown, input,
   and keyup all flow through Chromium's native input pipeline. A US
   keyboard layout table covers letters, digits, space, and ~30
   punctuation/shifted symbols with correct DOM code and virtual key;
   non-ASCII falls back to `char` insertion.
 - verifies document.activeElement actually landed on (or inside) the
   target before handing off to CDP, failing loudly otherwise.
 - re-runs validateCachedElement during readback so a rerendered
   replacement node surfaces as stale instead of phantom success.

Verified end-to-end against Zhihu (search "AI" now submits "AI") and
DuckDuckGo with punctuation ("C++ @2026.04 vs. Rust?" round-trips
through the URL unchanged). Also re-ran the full 4-model eval (140
runs): qwen3.6-flash gained +10.8 task points and ran ~30s faster
per test, consistent with no longer losing time to rotator clobber.
Map pins (SVG <circle>/<rect> children of <g> inside <svg>), icon
toggles drawn directly in SVG, and chart markers can have their own
cursor:pointer and click listener without an HTML wrapper. The prior
detection pipeline dropped them on two gates:

1. isMeaningfulPointerCandidate rejected non-HTMLElements outright,
   so the pointer-cursor signal never registered for SVG leaves.
2. resolveClickableCandidate walked from an SVG element to its
   parentElement and bailed when that parent was also an SVG (as it
   always is for pins inside <g>-wrappers).

Fix:
- Accept SVGElement in isMeaningfulPointerCandidate; the size/area
  heuristics below already work on SVG bboxes (the
  SVGGraphicsElement.prototype.getBoundingClientRect patch at the top
  of the scan makes layout reads fast and consistent).
- When resolveClickableCandidate is handed an SVG graphics element
  that classifies as clickable on its own, return it directly as a
  standalone candidate. Fall through to the existing HTML-ancestor
  walk only for decorative SVG children of interactive HTML wrappers
  (<button><svg>…</svg></button>) — behaviour unchanged for that case.

Surfaced by the mapquest_nearby_pins evaluation test: pins rendered
as <circle class="map-pin"> with cursor:pointer + addEventListener
click were never picked up in the highlight scan, so the agent had
no element_id to target. All 4 qwen models scored 4.5/12 on that
test both on main and on branch, a stable 4.5-point floor that the
pin-detection gap explained.

Verified end-to-end via open-browser skill against the mapquest
mock site (served with the shared /js/tracker.js dependency so the
page actually renders pins): 8 pins detected, agent clicks the
Space Needle pin and the place-detail panel opens with name, rating,
hours, address, website, phone. Existing highlight-detection /
highlight-any / element-actions-regression suites all green.
The single file under local_vendor/openhands-sdk/ was unreferenced by
pyproject.toml, uv.lock, or any source — the SDK is consumed via the uv
git source, not a local vendor tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…inal keyframe on stop

Two independent wins for the compiler-agent's trace view, both surfaced
by recording aab8711b on the latest session:

1. Merge consecutive keystrokes.

   The recorder emits one `input` event per keystroke (plus
   `beforeinput` on contenteditable rich editors), so a 10-letter title
   produces 10+ near-identical events. On a real Yuque recording this
   turned a 266-event trace into 123 input events out of the total —
   every actual action (click, navigation, drag) was buried in typing
   noise.

   New helper `coalesce_typing_events` in workflow_compiler walks events
   in order; runs of consecutive `input`/`change`/`beforeinput` on the
   same element identity collapse into the last event in the run. The
   survivor carries the final text snapshot and picks up
   `coalescedCount` + `coalescedEventIndexes` annotations so the agent
   can drill back into any single keystroke via `event_detail`.

   Identity uses a new `_stable_element_identity` (selector + ARIA label
   + placeholder + container selector) — the existing `_element_identity`
   folds `element.text` into the hash, which is exactly what changes
   between keystrokes and defeats coalescing.

   `TraceViewerExecutor` now applies the coalescer in its constructor
   and presents the folded list via `events`/`summary`; `_events_by_index`
   still indexes the full raw list so `event_detail` works on absorbed
   indexes. `summary` calls out the raw→coalesced delta and each listed
   event gets a `[coalesced ×N]` tag when more than one was folded.

   Smoke-tested against recording 589eb0e8 (266 events, 123 inputs):
   after coalescing, 7 input events remain (one per typing burst on one
   element); non-typing events are untouched.

2. Capture a final keyframe on `recording_stopped`.

   `stopRecording` picks the scope's currently-active tab (falling back
   to any recordable tab in scope) and calls `buildRecordingKeyframe`
   before the debugger session is torn down. The keyframe rides on the
   `recording_stopped` event's `event_data.keyframe` slot, so the
   existing trace viewer keyframe-count + `event_detail` image display
   works without further changes. Failures are logged and swallowed —
   stop must never block on screenshot flakiness.

Tests: new `test_coalesce_typing_events.py` (8 cases covering folding,
run separation, keyframe promotion, order preservation); existing
`test_workflow_compiler_contenteditable.py` still green; recorder
bun-test suite still passes with the new stop-time keyframe call
(gracefully no-ops when Chrome debugger API isn't available in the
harness).
…t in trace_viewer

Three layered fixes so that text typed into rich-text editor bodies (Yuque/
Lake editor and similar) reaches the compiler agent, plus an eval fixture
pinning the regression and SKILL.md improvements that surfaced from running
the recording→compile pipeline end-to-end.

Recorder (extension/src/content/index.ts):
- input listener now also matches contenteditable targets, with a new
  isContentEditableElement helper and a getContentEditableText helper that
  populates the serialized value from innerText.
- New beforeinput listener for contenteditable targets only — covers rich
  editors (Lake, ProseMirror, Slate, Lexical, TipTap) that intercept
  keydown + preventDefault and synthesize edits via their own DOM model
  so native input events never fire on the body. The DOM snapshot is
  deferred to a microtask so it reflects the post-mutation state.

Compiler view (server/core/compiler_agent.py):
- _format_value_with_tail renders long input values with both ends
  visible and the middle elided as "<head> ...(N more chars; use
  event_detail)... <tail>". Replaces the hard 80-char head-only
  truncation that hid user-typed instructions appearing late in the
  value.
- _handle_normalized_steps now surfaces a "field <selector>
  final_value=..." line per anchor for [form] steps, picking the
  latest snapshot in event order. The previous summary showed only
  step type and event indexes, hiding the actual typed content.

Eval fixture (eval/routine_eval/fixtures/github-trending-contenteditable-question/):
- Real recording (5c5cf4f5) where the user types instructions into the
  Yuque body for the replay agent to follow. expectations.yaml encodes
  the position-vs-identity ambiguity and forbids asking the user to
  retype visible content while leaving intent-clarification questions
  legitimate.

Tests (server/tests/unit/):
- test_workflow_compiler_contenteditable.py covers the html-fallback
  path in _extract_input_value (introduced earlier in this branch).
- test_compiler_agent_value_view.py covers _format_value_with_tail
  edge cases and the latest-in-event-order field picker, including
  paste-then-trim / clear-and-rewrite flows where longest-wins would
  surface stale text.

SKILL.md (skill/claude/ob-routines/SKILL.md):
- tmux launch keeps the window alive via "exec zsh" so [compiler:saved]
  and [compile-done] markers don't get lost when the window auto-closes
  on python exit.
- Monitor template detects pane-gone and emits a terminal event so a
  silent dead-pane poll loop is no longer possible.
- Adds a verify-after-saved step using list_routines.py.
- Quality-gate section now states explicitly that the gate reasoning
  is Claude's judgment, must be written as user-visible text before
  pressing Enter, and the compiler's wrap-up message is not a
  substitute.

Pin agent-sdk to commit 32e6edba (matching prompt update for the new
trace_viewer rendering).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Full eval on feat/compiler-default-alias with qwen3.{5,6}-{plus,flash}-fast.
105/140 passed (75.0%), raw score delta −21.8 vs main (−1.8%); infra-adjusted
−4.8 (−0.4%). See tmp/OBSERVATION_REPORT_20260424_100152.md for full
root-cause analysis (T1 SVG-clickable trade-off, T2 please_help_me eval
killswitch, flash instruction drift, etc).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…alias

The canonical version-controlled report at
`eval/routine_eval/compile_evaluation_report.json` is overwritten by every
run, so a multi-model eval loop (4 -fast models) ends up keeping only the
last run's data — which happened to be the weakest model and looked like
"all cases failed". When `--compile-alias` is given, write the canonical
copy to `compile_evaluation_report_<alias>.json` instead so each model in a
loop preserves its own baseline. The unsuffixed path is preserved as the
default for runs that use the server's default alias, so existing dashboards
and CI flows are unaffected.

Also commits the four per-model reports from a fresh rerun on 2026-04-24:

  qwen35plus-fast  : 1/3 pass  (intent_match peaks at 1.0 on the new
                                contenteditable fixture, confirming the
                                trace_viewer fixes from 8f11aa6 reach the LLM)
  qwen36plus-fast  : 2/3 pass
  qwen35flash-fast : 0/3 pass  (consistent across runs — flash drift)
  qwen36flash-fast : 1/3 pass

Per-test failure breakdown is in tmp/observation_notes_20260424_100152.md
plus the chat record from the rerun.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-commit autoformatted black (Python) and prettier (TS) over the files
touched on this branch. No semantic changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@softpudding softpudding merged commit d12340a into main Apr 25, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant